Exponential Space Improvement for minwise Based Algorithms
نویسندگان
چکیده
In this paper we introduce a general framework that exponentially improves the space, the degree of independence, and the time needed by min-wise based algorithms. The authors, in SODA11, [15] introduced an exponential time improvement for min-wise based algorithms by defining and constructing an almost k-min-wise independent family of hash functions. Here we develop an alternative approach that achieves both exponential time and exponential space improvement. The new approach relaxes the need for approximately min-wise hash functions, hence gets around the Ω(log 1 ) independence lower bound in [23]. This is done by defining and constructing a d-k-min-wise independent family of hash functions. Surprisingly, for most cases only 8-wise independence is needed for the additional improvement. Moreover, as the degree of independence is a small constant, our function can be implemented efficiently. Informally, under this definition, all subsets of size d of any fixed set X have an equal probability to have hash values among the minimal k values in X, where the probability is over the random choice of hash function from the family. This property measures the randomness of the family, as choosing a truly random function, obviously, satisfies the definition for d = k = |X|. We define and give an efficient time and space construction of approximately d-k-min-wise independent family of hash functions for the case where d = 2, as this is sufficient for the additional exponential improvement. We discuss how this construction can be used to improve many minwise based algorithms. To our knowledge such definitions, for hash functions, were never studied and no construction was given before. As an example we show how to apply it for similarity and rarity estimation over data streams. Other min-wise based algorithms, can be adjusted in the same way. 1998 ACM Subject Classification F.1.2 Modes of Computation Online computation
منابع مشابه
Optimal Densification for Fast and Accurate Minwise Hashing
Minwise hashing is a fundamental and one of the most successful hashing algorithm in the literature. Recent advances based on the idea of densification (Shrivastava & Li, 2014a;c) have shown that it is possible to compute k minwise hashes, of a vector with d nonzeros, in mere (d + k) computations, a significant improvement over the classical O(dk). These advances have led to an algorithmic impr...
متن کاملb-Bit Minwise Hashing for Large-Scale Learning
Abstract Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and ...
متن کاملb-Bit Minwise Hashing for Estimating Three-Way Similarities
Computing1 two-way and multi-way set similarities is a fundamental problem. This study focuses on estimating 3-way resemblance (Jaccard similarity) using b-bit minwise hashing. While traditional minwise hashing methods store each hashed value using 64 bits, b-bit minwise hashing only stores the lowest b bits (where b ≥ 2 for 3-way). The extension to 3-way similarity from the prior work on 2-way...
متن کاملApproximately Minwise Independence with Twisted Tabulation
A random hash function h is ε-minwise if for any set S, |S| “ n, and element x P S, Prrhpxq “ minhpSqs “ p1 ̆ εq{n. Minwise hash functions with low bias ε have widespread applications within similarity estimation. Hashing from a universe rus, the twisted tabulation hashing of Pǎtraşcu and Thorup [SODA’13] makes c “ Op1q lookups in tables of size u1{c. Twisted tabulation was invented to get good ...
متن کاملAsymmetric Minwise Hashing
Minwise hashing (Minhash) is a widely popular indexing scheme in practice. Minhash is designed for estimating set resemblance and is known to be suboptimal in many applications where the desired measure is set overlap (i.e., inner product between binary vectors) or set containment. Minhash has inherent bias towards smaller sets, which adversely affects its performance in applications where such...
متن کامل